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Abstract — This paper deals with the problem of universal 
lossless coding on a countable infinite alphabet. It focuses on 
some classes of sources defined by an envelope condition on the 
marginal distribution, namely exponentially decreasing envelope 
classes with exponent a. 

The minimax redundancy of exponentially decreasing envelope 



classes is proved to be equivalent to 



log^ 



Then, an 



4a log e 

adaptive algorithm is proposed, whose maximum redundancy 
is equivalent to the minimax redundancy. 

Index Terms — Data compression, universal coding, infinite 
countable alphabets, redundancy, Bayes mixture, adaptive com- 
pression. 



I. Introduction 

COMPRESSION of data is broadly used in our daily life: 
from the movies we watch to the office documents we 
produce. In this article, we are interested in lossless data 
compression on an unknown alphabet. This has applications 
in areas such as language modeling or lossless multimedia 
codecs. 

First, we present briefly the problematics of data compres- 
sion. More details are available in general textbooks, like [1]. 
Then we make a short review of preceding results, in which 
we situate the topic of this article, exponentially decreasing 
envelope classes, and we announce our results. 

A. Lossless data compression 

Consider a finite or countably infinite alphabet X. A source 
on A" is a probability distribution P, on the set of infinite 
sequences of symbols from X. Its marginal distributions are 
denoted by P", n > 1 (for n = 1, we only note P). The 
scope of lossless data compression is to encode a sequence 
of symbols Xi-n, generated according to P", into a sequence 
of bits as small as possible. The algorithm has to be uniquely 
decodable. 

The binary entropy H{P") = Epn [- logs P"(Xi:„)] is 
known to be a lower bound for the expected codelength of 
Xi-n- From now on, log denotes the logarithm taken to base 
2, while In is used to denote the natural logarithm. Since 
arithmetic coding based on P" encodes a message xi;n with 
[— log P" (a;i:„)] + 1 bits, this lower bound can be achieved 
within two bits. Then, the expected redundancy measures the 
mean number of extra bits, in addition to the entropy, a coding 
strategy uses to encode X". In the sequel, we use the word 
redundancy instead of expected redundancy. 
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Furthermore, together with Kraft-McMillan inequality, 
arithmetic coding provides an almost perfect correspondence 
between coding algorithms and probability distributions on 
A"". In this setting, if an algorithm is associated to the 
probability distribution Q", its expected redundancy reduces 
to the Kullbach-Liebler divergence between P" and Q" 



Z?(P";g'^ 



E 



log 



P"(Xi 



'(Xi-.n) 



We call this quantity (expected) redundancy of the distribution 
Q" (with respect to P"). 

Unfortunately, the true statistics of the source are not known 
in general, but P" is supposed to belong to some large class of 
sources A (for instance, the class of all iid sources, or the class 
of Markov sources). In this paper, the maximum redundancy 



i?„(g";A) 



supP„(Q";P") 

PgA 



measures how well a coding probability Q" behave on an 
entire class A. With this point of view, the best coding 
probability is a minimax coding probability, that achieves the 
minimax redundancy 

P„(A) =infP„(Q";A). 

Another way to measure the ability of a class of sources to 
be efficiently encoded is the Bayes redundancy 

P„,^(A) = inf / P„(g";P")dAi(P) 

where /z is a prior distribution on A endowed with the topology 
of weak convergence and the Borel a-field. Only one coding 
strategy achieves the Bayes redundancy: the Bayes mixture 



Mn,fj.{xi:n) 



P"(xi:„)dM(P) 



When A is a class of iid sources on the set A" = N* 
there is a natural parametrization of A by Pg{j) 
6 — {01,92, ■ ■ ■) & 6a. 6a is then a subset of 



N\{0}, 
O.j, with 



6=<^0 = (0i,02,...)e[O,ir:E' 



= 1 



i>i 



and it is endowed with the topology of pointwise convergence. 
In this case we write as a prior on 6a. 

Minimax redundancy and Bayes redundancy are linked by 
an important relation [2], [3]; it is written here in the context 
of iid sources on a finite or countably infinite alphabet, but 
Haussler [4] has shown that it can be generaUzed for all classes 
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of Stationary ergodic processes on a complete separable metric 
space. 

Theorem 1: Let A be a class of iid sources, such that the 
parameter set 0a is a measurable subset of 8. Let n > 1. 
Then 

i?„(A) = supi?„,^(A), 

where the supremum is taken over all (Borel) probability 
measures on Oa- 

The quantity sup^ i?„.p(A) is called maximin redundancy. A 
prior whose Bayes redundancy corresponds to the maximin 
redundancy is said to be maximin, or least favorable. 
Theorem 1 says that maximin redundancy and minimax redun- 
dancy are the same. It provides a tool to calculate the minimax 
redundancy. 

Before speaking about known results, let us make mention 
of other two notions. 

With an asymptotic point of view, a sequence of cod- 
ing probabilities {Qn)n>i is said to be weakly uni- 
versal if the per-symbol redundancy tends to on A: 

suppgAlim„^oo iZ?(P";Q") = 0. 

Instead of the expected redundancy, many authors consider 
individual sequences. In this case, the minimax regret 

Ki^) = inf sup sup log nnf^^'^l 

plays the role that the minimax redundancy plays with the 
expected redundancy. 

B. Exponentially decreasing envelope classes 

In the case of a finite alphabet of size k, many classes of 
sources have been studied in the literature, for which estimates 
of the redundancy have been provided. In particular we have 
the class of all iid sources (see [5]-[10], and references 
therein), whose minimax redundancy is 

/c - 1 , n , r(l/2)'^ 

This last class can be seen as a particular case of a (fc — 1)- 
dimensional class of iid sources on a (possibly) bigger alpha- 
bet, for which we have a similar result under certain conditions 
(see [11]-[13]). Similar results are still available for classes of 
Markov processes and finite memory tree sources on a finite 
alphabet (see [5], [14]-[16]), and for fc-dimensional classes of 
even non-iid sources on an arbitrary alphabet (see [17]). 
The results become less precise when one considers infinite 
dimensional classes on a finite alphabet. A typical example is 
the class of renewal processes, for which we do not have an 
equivalent of the expected redundancy, but we know that it is 
lower and upper bounded by a constant times ^/n (see [18], 
[19]). 

Eventually, it is well known that the class of stationary ergodic 
sources on a finite alphabet is weakly universal (see [1]). 
However, Shields [20] showed that this class does not admit 
non-trivial universal redundancy rates. 

In the case of a countably infinite alphabet, the situation is 
significantly different. Even the class of all iid sources is not 



weakly universal (see [21], [22]). Kieffer characterized weakly 
universal classes in [21] (see also [22], [23]): 

Proposition 1: A class A of stationary sources on N, is 
weakly universal if and only if there exists a probability dis- 
tribution Q on N, such that for every P e A, D{P; Q) < oo. 

In the literature, we find two main ways to deal with infinite 
alphabets. The first one [24]-[32] separates the message into 
two parts: a description of the symbols appearing in the 
message, and the pattern they form. Then the compression 
of patterns is studied. 

A second approach [23], [33]-[36] studies collections of 
sources satisfying Kieffer's condition, and proposes compres- 
sion algorithms for these classes. A result from [36] indicates 
us such a way: 

Proposition 2: Let A be a class of iid sources over N». Let 
the envelope function / be defined by f{x) = supp^y^ Pi^)- 
Then the minimax regret verifies 

Rni^) < OO 4^ ^ f{x) < OO. 

It is therefore quite natural to consider classes of iid sources 
with envelope conditions on the marginal distribution. In this 
article we study specific classes of iid sources introduced by 
[36], and called exponentially decreasing envelope classes. 

Definition 1: Let C and a be positive numbers satisfying 
C > e^". The exponentially decreasing envelope class A,^e-Q 
is the class of sources defined by 

Ace-» = {P : Vfc > 1, P{k) < Ce-"'' 

and P is stationary and memoryless.} 

The first condition addresses mainly the queue of the distribu- 
tion of Xi ; it means that great numbers must be rare enough. 
It does not mean that the distribution is geometrical: if C is big 
enough, many other distributions are possible. Furthermore we 
will see that the exact value of C does not change significantly 
the minimax redundancy, unlike a. 

Since in this paper we are going to only talk about 
exponentially decreasing envelope classes, we simplify the 
notations i?„(Q"; Ace-° ), i?„(Ace-° ), and i?„^^(Ace-c. ) 
into Rn{Q";C,a), i?„(C, a), and i?„^^(C, a) respectively. 
The subset of Q corresponding to Ape-c is denoted by 

Y,0^ = l and Vi > 1, 6*, < Ce""'}. (1) 

i>l 

We present two main results about these classes. 

In Section II we calculate the minimax redundancy of 
exponentially decreasing envelope classes, and we find that it 
is equivalent to ^ log^ n as n tends to the infinity. This 
rate is interesting for two main reasons. Up to our knowledge, 
exponentially decreasing envelope classes are the first family 
of classes on an infinite alphabet for which an equivalent of 
the minimax redundancy is known. Then, even the rate is new: 
until now only rates in log n or ^/n have been obtained. 

Once the minimax redundancy of a class of sources is 
known, we are interested in finding a minimax coding algo- 
rithm. Section III proposes a new adaptive coding algorithm, 
and we show that its maximum redundancy is equivalent to 
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the minimax redundancy of exponentially decreasing envelope 
classes. 

Eventually, the Appendix contains some proofs and some 
auxiliary results used in the main analysis. 

II. Minimax redundancy 

In this section we state our main result. Theorem 2 below 
gives an equivalent of the minimax redundancy of exponen- 
tially decreasing envelope classes. To get it, we use a result 
due to Haussler and Opper [37]. 

Theorem 2: Let C and a be positive numbers satisfying 
C > e^". The minimax redundancy of the exponentially 
decreasing envelope class A^-g-a verifies 

Rn{C,a) ^ j^—log^n. 
n-foo 4a log e 

Theorem 2 improves on a preceding result of [36, Theo- 
rem 7]. In that article the following bounds of the minimax 
redundancy of exponentially decreasing envelope classes are 
given: 



1 



■log^n (1 + 0(1)) 



8a log e 

<RniC,a) 

■ log n 



< 



0(1). 
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In subsection II-A we outline the work done in [37], 
and then we use it in subsection II-B to prove Theorem 2. 
Eventually, we discuss in subsection II-C the adaptation of 
this method to other envelope classes. 

A. From the metric entropy to the minimax redundancy 

To study the redundancy of a class of sources, [37] considers 
the Bellinger distance between the first marginal distribu- 
tions of each source. Bounds on the minimax redundancy 
are provided in terms of the metric entropy of the set of 
the first marginal distributions, with respect to the Hellinger 
distance. As a consequence, that method can be applied only 
to iid sources. However it is very efficient in the case of 
exponentially decreasing envelope classes. 

First, we need to define the Hellinger distance and the metric 
entropy. In the case of sources on a countably infinite alphabet, 
the Hellinger distance can be defined in the following way: 

Definition 2: Let P and Q two probability distributions on 
N,. Then the Hellinger distance between P and Q is defined 

by 

h{P,Q) = ,/E(v^-v^)'- 

A related metric can be defined on the parameter set Q: 



d{d,e') = h{Pg,Pg- 



\ k>l 



From a metric we can define the metric entropy. We need 
to define first some numbers. 

Definition 3: Let 5* be a subset of Q, and e be a positive 
number. 



1) We denote by 'D^{S,d) the cardinality of the smallest 
finite partition of S with sets of diameter at most e, or 
we set I?c(S', d) = oo if no such finite partition exists. 

2) The metric entropy of {S, d) is defined by 

ne{S,d) = \nV,{S,d).^ 

3) An e-cover of 5 is a subset A C S such that, for all 
X in S, there is an element y of A with d{x, y) < e. 
The covering number Nt{S, d) is the cardinality of the 
smallest finite e-cover of S, or we define Me{S,d) ~ oo 
if no finite e-cover exists. 

4) An e-separated subset of S* is a subset A C S such 
that, for all distinct x, y in A, d{x, y) > e. The packing 
number Aie{S,d) is the cardinality of the largest finite 
e-separated subset of S, or we define Aie{S, d) = cx) if 
arbitrary large e-separated subsets exist. 

The following lemma explains how these numbers are 
linked. It is a classical result that can be found for instance in 
[38]. 

Lemma 1: Let 5* be a subset of 0. For all e > 0, 

M2e{S,d) <V2e{S,d) <K{S,d) <M,iS,d). 

Lemma 1 enables us to choose the most convenient number 
to calculate the metric entropy. 

From the metric entropy one can define the notion of 
metric dimension, which generalizes the classical notion of 
dimension. But the metric entropy lets us know in some way 
how dense the elements are in a set, even infinite dimensional. 

Another quantity that [37] uses is the minimax risk for the 
(1 + X)-ajfinity 

Rx{A) = inf sup ^Pe(fc)i+^Q(fc)-\ 



fc>i 



defined for all A > 0. 

More precisions about the (1 + A)-affinity are given in [37]. 
See also [39] for a special regard payed to envelope classes. 

In the case of an envelope class A / defined by an integrable 
envelope function /, it is easy to see that R\ < oo for all 
A > 0. Indeed the choice 



Q{k) 



f{k) 



E;>i/(0 



leads to the relation 



, fe>l 

We can now write a slightly modified version ^ of Theorem 5 
of [37] in the context of data compression on an infinite 
alphabet. 

Theorem 3: Let A be a class of iid sources on N*, such that 
the parameter set Qa is a measurable subset of Q. Assume 
that there exists A > such that R\ < oo. Let h{x) be a 

' We follow [37] in this definition of the metric entropy. Several authors use 
a slightly different definition, based on the covering number or the packing 
number. 

^The separation of the upper and lower bounds have no effect on the proof 
given by Haussler and Opper. A complete justification is available in [39]. 
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continuous, non-decreasing function defined on the positive 
reals such that, for all 7 > and C > 0, 

1) 

^.^ h{cx{h{x)r) ^ ^ 



and 



2) 



Then 
1) If 



h{x) 



h(Cx(\nx)^) 

lim = 1. 



then 



h{x) 



n,{QA,d) ^ h[- 



i?„(A) ^ {\oge)h{y/^). 



2) If, for some a > and c > 0, 



then 



lim inf ■ 



limmi - — — — — — -— > c, 

.^0 (l/e)"/i(l/e) - 

fln(A) 

2/(q+2) 



„a/(a+2) [;j(„l/(a+2))j 

3) If, for some a > and C > 0, 

limsup 7 — — -- — — < C, 



> 0. 



then 

lim sup 

n— >cjo 



(nlnn)"/("+2) [/i(ni/("+2))] 



2/(a+2) 



< 00. 



The conditions concerning the function h mean that h cannot 
grow too fast. For instance, h can grow like C{\nx)^, with 
/3 > 0. 

The first case in the theorem is the one we use for ex- 
ponentially decreasing envelope classes. In this case, the fast 
decreasing envelope produces a "not too big" metric entropy. 
Theorem 3 gives us an equivalent of the minimax redundancy 
of the class of sources when n goes to the infinity. This 
turns out very useful, as it improves a preceding result of 
[36]. However it is only an asymptotic result, without any 
convergence speed. 

The second and the third items correspond to bigger classes 
of sources. In these cases the result is a bit less interesting: it 
gives a speed for the growth of the redundancy, but without 
the associated constant factor Furthermore there is a gap of 
(lnn)"/("+2^ between the lower bound of point 2 and the 
upper bound of point 3. However it allows us to retrieve more 
or less a result of [36] for another type of envelope classes. 

We develop now these applications. 

^The (log e) factor comes from the use of the logarithm taken to base 2, 
in the definition of R„. 



B. The minimax redundancy of exponentially decreasing en- 
velope classes 

Here we want to prove Theorem 2 by applying Theorem 3. 
Thus we have to calculate the metric entropy of exponentially 
decreasing envelope classes. This is done by Proposition 3: 

Proposition 3: Let C and a be positive numbers satisfying 
C > e^". The metric entropy of the parameters set Qc,ct 
verifies 

He(ec.a,d) = (l+o(l))-ln'(l/e), 

a 

where o(l) is a function g{e) such that g{e) — > as e 0. 

Proof of Theorem 2: Just apply Theorem 3, with h{x) = 
— ln^(a;), to get the result. ■ 

Proof of Proposition 3: The outlines of the proofs of the 
next lemmas can be found in Appendix A. The corollaries are 
simple applications and the proofs are skipped. 

We start with general considerations. Let A/ be the envelope 
class defined by the integrable envelope function /. Let Qf 
be the corresponding parameter set 

ef = {9 = {9,,e2,...)e[0,lf: 

= 1 and > 1, 6, < /(^)}. 

i>l 

The function 6 i-> {^/0i, \f&2i ■ • •) is an isometry between 
the metric space (Qf^d) and the subset Aj n {||a;|| = 1} of 
I?, equipped with the classical euclidean norm || • |j, where Aj 
is defined by 



^/ = {(a;fc)fceN* ■ ^k^W,Q<Xk< Vfik)}. (2) 

The metric entropy of {&f, d) can be calculated in this space. 

Next we truncate some coordinates, to work in a finite 
dimensional space instead of d'^. Together with an adequate use 
of Lemma 1, this helps us to obtain upper and lower bounds 
for the metric entropy of {Qf,d). We start with the upper 
bound. 

Lemma 2: Let A/ be the envelope class defined by the 
integrable envelope function /, and let e be a positive number 
Let denote the integer 

iV, = inf<|n>l: ^ =^ 



fc>ri+l 



For J7 e and a > 0, let B^iv {U, a) denote the ball in 
with center U and radius a. Then 

HeiOf, d) < N, ln(l/e) + ln2 + AiN,) + B{e), 

where 

AiN) = - In Vol (0, 1)) = In ' 



and 



Furthermore 



k=l 



Am ^^^InAT, 
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Note that 



-NJn{l/e) - 2iVJii2 < B{e) < -N^ 



These bounds on B{e) show that B{e) tends to decrease the 
upper bound, while A{N^) contributes to its growth. If IniV^ 
behaves Hke ln(l/e) up to a constant factor, then the upper 
bound given in Lemma 2 corresponds to a constant times 
Nf: In TVe, and we are concerned with the point 3 of Theorem 3. 

If we apply Lemma 2 to the case of exponentially decreasing 
envelope classes, we obtain the upper bounds 

a a 1 — e " 



a 



B{e) < --Nf 



InC 



1 



2 

This leads to 

Corollary 1: Let C and a be positive numbers satisfying 
C > e^". The metric entropy of the parameter set &c',a 
defined by (1) verifies 

n,{&c,a,d) < (l + o(l))-ln2(l/e), 



where o(l) is a function g{e) such that g{e) — > as e — > 0. 

Now we need to get a lower bound on the metric entropy. 
In this case too, we want to truncate some coordinates to bring 
ourselves to a smaller finite dimensional space. This time we 
truncate the first coordinates. Let us consider the number 

// = mm{l > : ^ f{k) < 1}. 

A:>; + 1 

Lemma 3: Let A/ be the envelope class defined by an 
integrable envelope function /, which verifies 



fe>i 



Let e > be a positive number, and let jti > 1 be an integer. 
Then 

n,{ef,d)>- E ln/(fc) + mln - +A(m), 



k=lf + l 



where A{m,) is defined as in Lemma 2: 

A{m) = -lnVol(BK^(0,l)) 



m 

~ — mm. 

m— >oo 2 



Note that exponentially decreasing envelopes verify the con- 
dition J2k>i fi^) — 2- Indeed the envelope of exponentially 
decreasing envelope classes is 

/(fc)=min(l,Ce-""), 

and the condition C > e^" entails that /(I) = /(2) = 1. 

From Lemma 3 we can infer the following corollary, with 
the choice m — In (i)J . 

Corollary 2: Let C and a be positive numbers satisfying 
C > e^". The metric entropy of the parameters set Qc.a 
verifies 

n,{Qc,c.,d) > (l + o(l))iln2(l/e), 

a 

where o(l) is a function g{e) such that g{e) as e — > 0. 
Note that the bound is the same as in Corollary 2. Therefore 
this concludes the proof of Proposition 3. ■ 



C. What about other envelope classes? 

In [36] the redundancy of another type of envelope classes is 
also studied. The power-law envelope class A^.-a is defined, 
for C > 1 and a > 1, by the envelope function fa.c{x) — 
min(l, j^). The bounds obtained in [36, Theorem 6] are 

^(a)ni/" log[CC(a)J 
<M„(Ac.-c) 



< 



2Cn 



a - 



l/a 



(3) 



(logn)i-i/" + 0(l), 



where 



A{a) = 



-i/(C(")") 



,1-l/c 



and ( denotes the classical function ({a) = J2k>i sF' ^'^^ 
a> I. 

If one adapts the calculus made earlier to the power-law 
envelope classes, one can get the following upper and lower 
bounds; 

There are two (calculable) constants Ki,K2 > such that, 
for all e > 0, 



<m< A-2(i + o(i)) ( i 



In ( i 



Unfortunately this formula leaves a gap between the lower 
bound and the upper bound. The application of Theorem 3 

2 

makes the gap worse. Indeed the polynomial part (7) of 
the metric entropy causes an additional gap of log^^" n. In 
practice the bounds are the following: 

There are two (unknown) constants C, c > such that, for all 

n > 1, 

c(l + o(l))?7i/" < i?„(Ac.-c) < C(l + o(l))ni/"logn. (4) 

These inequalities improve in no way the result of [36]. 
May a better calculus of the metric entropy improve either their 
lower bound or their upper bound? Anyway the metric entropy 
of power-law envelope classes is "too big" to efficiently apply 
Theorem 3: it does not leave the hope for an equivalence, as 
for exponentially decreasing envelope classes. To summarize, 
the strategy based on the metric entropy and Theorem 3 turns 
out efficient for "small" classes of sources. 

III. AutoCensuring Code 

This section presents a new algorithm called AutoCensuring 
Code (ACcode). It is in fact a modification of the Censuring 
Code proposed by Boucheron, Garivier and Gassiat in [36]. 
We keep the idea that big symbols are very few, and must be 
encoded differently, with an Elias code. Smaller symbols are 
encoded by arithmetic coding based on Krichevsky-Trofimov 
mixtures, which are known to be effective for finite alphabets. 
Our innovation is a data-driven cutoff Mi = supj^<;.<j Xk 
used to encode X^+i: with this choice we do not need to know 
the exact parameters of the exponentially decreasing envelope. 

ACcode is a prefix code on the set of all finite length 
messages, and it works on line. Its maximum redundancy 
on an exponentially decreasing envelope class Ace-^- is 
equivalent to the minimax redundancy of this class of sources. 
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Furthermore AC code is adaptive, as the same algorithm ver- 
ifies this property with all exponentially decreasing envelope 
classes. This is formulated in the following theorem, proved 
in Appendix B. 

Let ACcode(a;i:„) denote the binary string produced by 
ACcode when it encodes the message xi;n, and let /(•) denote 
the length of a string. 

Theorem 4: For any positive numbers C and a satisfying 

C > e2". 



sup 



^P-^[^(ACcode(Xl,„)) - 



RniC,a) 



The difference between the redundancy of ACcode and the 
minimax redundancy is not necessarily bounded: there may 
exist codes whose redundancy is smaller than the redundancy 
of ACcode, but with a benefit negligible with respect to 
log^ n. 

Additionally, Theorem 4 enables us to retrieve the upper 
bound of the minimax redundancy obtained in the section II. 

Let us now define ACcode. Let n > 1 be some positive 
integer, and let xi-n ~ xiX2 . . . a;„ be a string from N" to be 
encoded. We define the sequence of maxima 



mo = and 



sup Xk, for all 1 < i < n. 

l<fe<i 



The sequence (7Tij;)i<i<,i is non-decreasing, piecewise con- 
stant. For 1 < i < ri, let = E}=i lmj>mj_i be the number 
of plateaus between 1 and i. For 1 < fc < n", let be the 
fc* new maximum; 

rrii = m„o. (5) 

We define also mp = 0. Let string m be the sequence 
(mi — mo + 1), . . . , (m„o — m„o _i + 1), 1. m is encoded 
into a binary string C2 by applying Elias penultimate code 
(see [33]) to each number in m. It is a prefix code which 
uses Ie{x) bits to encode a positive integer x, with 



Ie{x) = 1 + [logi-J + 2 [log[logxJ + IJ if X > 2. 



(6) 



Meanwhile the sequence of censored symbols is encoded us- 
ing side information from m. Consider the censored sequence 
5?1:ti = X1X2 ■ ■ -Xn defined by 



Xi if Xi < rrii^i, 
otherwise. 



All symbols greater than mi_i are encoded together: they 
are replaced by the extra symbol 0, and this extra symbol 
is encoded instead. has a special use in our setting: it makes 
the decoder to know when uii changes, and that the new value 
has to be read in C2. We add at the end of xi;n an additional 
0, which acts as a termination signal together with the last 1 
in m. This makes our code to be prefix on the set of all finite 
length messages (whatever n). 

Therefore we produce the binary string CI by arithmetic 
coding of xi-nO- The conditional coding probabilities are 



defined by 



i + 



if 1 < j < nii, 



Xl, 



1/2 



i + 



where for j > 1 and i > 0, nj is the number of occurrences 
of symbol j in xi-a (with convention tiq = for all j > 1). 

If i < n — 1, the event {X^+i = 0} is equal to {X^+i > 
Ali}. If Xi+i ~ j > irii, then — 0, and we still have 



i + 



In the sequel we note the coding probability used to encode 
the entire string Xi;n0 by 

Q""^\^1:«0) = Q„+l(0|a;i:„) Y[ Qt+l{Xi+l\xi;i). 

1=0 

A remark we can do is that the symbol is always 
considered as new: when Xi+i > rrii, we encode but we 
increment the counter n^'^\ (This choice has been made to 
simplify the calculation of the redundancy of ACcode, but 
we suspect that changing this behavior could improve the 
performances.) 

Now we have defined CI and C2, we have to describe how 
they are transmitted. To keep our code on line, we overlap 
these two strings in the following way. 

Arithmetic code needs a certain amount of bits, say li, to 
send the first i symbol of xi;n- Unfortunately, li depends 
on whether i = ?7 + 1 or not. In previous case Z^+i = 
[— log(5""'"^(xi:„0)] + 1, and in later one depends on the 
following symbols and has to be computed. 
ACcode begins with C2, by the transmission of the Elias 
code of mi + 1. Then the transmission of CI is initiated. 
Suppose that x; = and = k. As soon as li bits of CI 
have been sent, the ACcode algorithm sends the Elias code 
of fhk — fhk-i + 1- Then CI is transmitted again, from the 
next bit. 

To decode the symbol in CI, the knowledge of the current 
maximum mi_i is needed; it is obtained from the beginning of 
the string C2. The decoder also needs the counters {nl_i)j>i, 
which can be computed from the first i — 1 decoded symbols. 
As soon as U, bits of C 1 have been received, Xi can be decoded. 
When the decoder meets a at the position, he knows that 
the next bits are the Elias code of the next symbol in m, and 
deduces m^ via (5). Since the Elias code is prefix, the decoder 
knows when he receives CI again. Then the [i + 1)* symbol 
can be processed. 

Fig. 1 shows an illustration of the transmission process. In 
this example, the initial message is xi-a = 5, 3, 2, 7. Then the 
message encoded in CI is xia^ = 0,3,2,0,0. 13 bits are 
needed to transmit the second 0, and 15 bits for the last one. 
In C2 we transmit m = 6, 3, 1. 

In the previous example, exact calculations have been per- 
formed, but this is not sensible for a practical implementation 
of arithmetic coding. Some rule is needed to set the precision 
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omo piiooiiooiioi 0101 ^ ^ 

beginning last bits 
^ of CI ^ of CI ^ 

Elias code Elias code EUas code 

of fhi + 1 of m2 — fhi + 1 of 1 



Fig. 1. Example of ACcode 



in calculus, and it must be used by both coder and decoder. To 
avoid a too big extra redundancy caused by approximations, 
precision can grow as n grows. For instance, calculations can 
be made in memory with a further precision of 2 [log i] bits, 
in addition to the [— \ogQ^{xi;i)~\ + 1 bits needed to encode 
xi;i; this insures that the extra redundancy is bounded. 

Appendix A 

Metric entropy of exponentially decreasing 
envelope classes 

We give here the outlines of the lemmas we stated in 
subsection II-B. 

Proof of Lemma 2: iV^ denotes the threshold from which 
we want to truncate the coordinates. If y = {yn)n>i is an 
element of Af, its truncated version is y ~ (yriln<jvjn>i- 
One can check that 

\\y-y\\ < ^■ 

Suppose now that S is an e/4-cover of {y E Af : Vn > 
Ne,yn = 0}. Let z denote an element of Af. Then it exists 
some y E S such that \\z — y\\ < e/4. Thus \\z — y\\ < e/2, 
and S is an e/2-cover of Af. This leads to 



,l<fc<Af, 



< 



Vol (BR«40,f)) 

A first consequence of that calculus is that ^^{Qf^d) is 
finite for all e > 0. The first assertion of Lemma 2 is then 
obtained by applying the logarithm function. 

The rest of Lemma 2 follows from the Feller bounds, in 
their version proposed by [40, ch. XII]: 



nx) 



'2-Kx'' 



with fi e [0, 1] 



(7) 



Proof of Lemma 3: Let m > 1 be an integer We project 
the set Af n {||a;|| ~ 1} over the m-dimensional space 



E„ 



{oy 



{0} 



{k:k>lf+m+l} 



generated by the coordinates from // + 1 to If + m. This leads 
to 

(If+m 
n [o, VTM] , II • IIh 
k=lf+l 



> 



Vol (ntr;+i 



Vol(BK™(0,e)) 
It only remains to apply the logarithm function. 



Appendix B 
Redundancy of ACcode 

A. Moments of M„ 

We first need a lemma which contains several useful results 
about the moments of M„. 

Lemma 4: Let C and a be positive numbers satisfying C > 
e^". Then, for all n > 1, 

1) 

I / (J 

sup Ep [M„] < - In n + In h 1 

PGA^.-o. a V 1 - e " 



2) 



3) 



sup 



A/„>J-ln-S:47; 



o 



Inn 



sup Ep[M„lnAf„] = o(ln^n). 



Proof: Let F denote the distribution function associated 
with P. For t > 0, we have 

fc>[tj+i 



< 



c 



»(LtJ+i) 



1 - e-° 
<e-o(t-/3). 



where /? = — In , ^_ 
where 



Therefore F{t) > G{t) for all t E 



Git) = lt>^ 1-e 



We can identify in G the distribution function of the random 
variable f3 + Y, where Y follows the exponential distribution 
with parameter a. 

Let Ui, . . . ,Un be n iid random variables following the 
uniform distribution on [0, 1]. For 1 < i < n, let us define 

Y^G-\U^)~I3, 
where F^^ and G^^ denote the pseudo-inverses of F and G: 

Vi e [0, 1], F-\t) = inf{a; G M : F{x) > t}. 

Then the n-dimensional vector = {X{,...,X!^) 

has the same distribution as Xi-n, and the maxima — 
supi<i<„ and M„ follow the same distribution. 

On the other hand, the relation F > G entails X^ < (3 + Yi, 
for all 1 < i < n. As the consequence, if Al^ = sup2<i<„ Yi 
denotes the maximum of all Yi, we have < /? + M^. 



8 



Since the random variables Yi are independent, the probability 
distribution of is easy to calculate. Indeed for all t > Q, 

P(M;' < = P(V 1 < i < n, Y,<t) 



We can write down the density function of A/": 



fit) = 



nae 




^1 



at\n—l 



if t > 0, 
otherwise. 



Now we look for an upper bound of ]E[M„] by taking 
advantage of the knowledge of that distribution: 

E[A/„] = E[A'Q 

< lE[/3 + AO 



(3+ / tnae""*(l 



f3- 



*)") dt 



integrating by parts. Use now the change of variables 

u = 1 - e""* 
t 



- In(l-M) 



1 



1 



< - I In n 

a 



1 -u" 
1-u 

1 +ln- 



-du 



C 



1 - e-"^ 

Since the upper bound does not depend on P, that achieves 
the proof of the point 1. We can handle the point 2 in the 
same way. For all i > 0, we have 

E [MnlM„>fi+t] < E [(/3 + A'Ol 



< I {f3 + u)nae 
1 



ne 



-at 



t+-+P 



a 



With i = |- In n, we get the second point of Lemma 4. 

The third item is similar. Since the function x H' xhix is 
increasing on [1, +oo) and 1 < A/^ < /? + Ml[, we have 

E[A/„lnAf„] 

<E[(/3 + Af:)ln(/3 + Af;:)] 

= E [1a/4'<M/^ + K) ln(/3 + M")] 

+ E flA/;'>/3lM"<^ In J/S + <') HP + K') 



■E 



< 2/31n(2/?) 



4 / 4 

— (ln?i) In — Inn 

a \a 



■E 



2A<ln(2A./;:)l^,„>.i„„ 



' 4 4\ 4 
< 2/31n(2;3) + ( - In- lnri, + - (In ??) (In In n) 
\a a I a 



■E 



Let us define 



4Af''^l 



A/''>i Inn 



7(n) = 2/31n(2/3) + ( - In - ) Inn + -(lnn)(lnlnn). 



Note that 7(n) = o(ln^ n). Then 

/>oo 

E [M„ In A/„] < 7(n) + / Au^n a e""" dw 



4ne' 



-2 Inn 



7(n) 



Taking the supremum over P, we get 



-(41n^n + 41nn + 2). 



sup Ep [A/„ In A/„] < 7(77.) + 



161n n+ 161nn + 8 



o(ln^ 



B. Contribution of CI 

Proposition 4: Let C and a be positive numbers satisfying 
C > e^". Then 

sup Ep„ [- logQ"(Xi:„) - H{P")] 
j'eA„,_„. 



< (1 + 0(1))- 



1 



■ log^ n. 



Aa log e 

Proof: We give here the sketch of the proof, and we delay 
the proofs of (8), (9), (10), and (11). 
Here we deal with the quantity 



(A) 



sup Ep„ [- logQ"(^l:n) - H{P^)]. 



that corresponds to the contribution of CI. As we saw in 
Section III, the coding probability Q" is based on Krichevsky- 
Trofimov mixtures. For k > 1, let KTk denote the usual 
Krichevsky-Trofimov mixture on the alphabet {l,...,fc}, 
whose conditional probabilities are, for all < « < n — 1 
and for all 1 < j < k, 



KTk{X, 



i+l 



]\Xl.,i = .T1:j) = k ■ 



Let us choose k = ijin + 1. In this case, there is a simple 
relation between KTrn„+i and Q". For any sequence of n 
positive integers xi-n G N", 

Qi+l{Xi+i = XiJ^i\Xi-i = Xi:i) 

_ 2i + 1 + m„ _ 
2z + 1 + mi 

As a consequence, we can link the redundancy of Q" to the 
redundancy of KTm^j^i'- 

n-l 



l0gQ"(Xl:„) = log A'rA/„ + l(Xl:„) + log 

4=0 

and therefore 

(Ai) 



22 + 1 + M„ 
2i + 1 + M, 



(A)= sup Epn[-logXrA/„+i(Xi.„)-ff(P") 
PeA^._„. V 



(A2 



E? 



2i + l + Mn 



2i + 1 + M, 
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Note that (A2) corresponds to the gain in redundancy of Q" 
with respect to KTm„+i- It illustrates the benefit of taking 
Mi instead of A/„ as cutoff to encode X^+i. 
On the one hand, we have 



(Ai)< 



logn + E[log(Af„ + l)] 



(8) 



Since E[logA/„] < E[M„], Lemma 4 entails 



sup 



C[log(A/„ + l)] = o(log2n). 



Applying Lemma 4 again, we see that (Ai) produces a 
redundancy equivalent to ^ log^ n, which is twice bigger 
than the minimax redundancy obtained in Theorem 2. So, we 
will hope the corrective term (A2) to be about ^ log^ n. 
To deal with (A2), we use the concavity of the log func- 
tion, and we group the terms in the sum, by Af„. Let 
m = be the number of bundles. 

To simplify the expression, we also neglect few terms at 
the beginning of the sum. Let (/i„)„>i be a non-decreasing 
sequence of positive integers, such that /i„ 00 as n 00, 
and let us define A„ = 2/i„ log ( 1 + ) ■ Then 



(A2) > A„E^ 



E 

.k=h„ + l 



Mn - MkM„ 



2(fc + l) 



(9) 



It is easy to verify that the function x 1-^ x log (l + i) is non- 
decreasing, and tends to loge when x tends to the infinity; 
therefore (A„) tends to loge. We can write now 



(A) < sup 



E[M„] 



logn 



.k=h„ + l 



2(fc+ 1) 

(A3) 



o(log^ n) 



< - sup Ep^ 



Mn log 71 - XnMn 



(A4) 



^ k + 1 



— sup Epn 

2 PGA^,_„. 



E 



Ml 



k + 1 



_k=h„ + l 

Let us choose hn = max{l, [Inn — 2J}. Then 
(A3) = o(log2n) 



(A4)< 

Therefore we have 



2 +o{log n) 
2a log e 



o(log^ n) 

(10) 
(11) 



(A)<(l + o(l))- 



1 



4a log e 

which concludes the proof of Proposition 4 
Proof of (8): Let 



log^ n 



P"(X1:„) =SUpP"(xi:„) = n 

j£{x-i_,...,Xn} 



be the maximum likelihood of the string xi-^ over all lid 
distribution on N". Then 



(Ai) < Ep. 



log 



sup lo! 



P"(a;i:„) 



i:„G{l,...,M„+l} KTM„ + l{xi;n) 



Now we can apply a result from Catoni [9, prop 1.4.1]: 
Lemma 5: For all A: > 1 and for all xi.„ e {!,..., fc}", 

— fc — 1 

- logA'rfc(xi.„) + logP"(a;i:„) < logn + logfc. 



Proof of (9): We group the terms in (A2), A/„ by Af„: 

M„ - M, 



(A2) > Epn 



m-l (fc+l)M„ 

E E 

A;=l i=A;A/„ + l 



2i -h Mi + 1 



From the relation M^ < Mk' for all k' > k > 1, we can 
infer, for all i > kMn, 

Mr,-M, ^ M„ 1 



2i + M, + l - 2kMn 2k 
Since log is a concave function, we have log(l + x) > 

£l£g(l±£) for ^- 

we choose a = 



for all a > and < a; < a. Consequently, if 



(A2) > Ep,. 

> Epn 



^ 5: 2Hogfl ■ 



2k J 2i + M, + 1 



E 

LA:=/i„ + l 



M„ MkM„ 

' 2k + 2 



Proof of (10): We have 



(A3) = sup 

PeA„,- 



Li>l 

X f jlogn-A„j J2 ^ 



Then we plug in /i„ = [In n — 2J . For n large enough, /i„ > 1, 
and we have 



^ E fc + 1 



fc=/i„+i 



1 /-LVJ+i dx 



?l — 1 



3 In 



> j ln(n - 1) - j In j - j In(lnn), 



2 - ln(lnn] 



and therefore 



(A3) < sup [(log e - A„)E[A/„] In 1 
+ A„E[A/„]ln- 



n - 1 

A„E[A/„ In Mn] + XnE[Mn] ln(ln n)] . 
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Then, if we use Lemma 4 and the fact that A„ tends to log e, 
we get (10). ■ 
Proof of (1 1): We want to commute the expected value 
and the sum in (A4). To do it, we need to get rid of m. We can 
note that the condition k < ?n ^ LtF"J entails fcM„ < n — 1. 
Consequently, for n big enough. 



(A4) < sup Ep« 



< sup Epn 



n-l 



E 

Lfc=3 
n-l 

E 

.k=3 



Ml 



k + 1 



k+l 



< ^ c.- ■ 

fe=3 



k + l 



P6A, 



fc=3 
n-l 



sup Epn [M„1a/„>/„] ^— — -, 

i ^21nn + In Y^^^^J . We can now plug in 
jmma 4: 

-+«(l)EfcTT 



where Z„ = 

the results of Lemma 4: 

^ _l 1 _l In _ c 

(A4) < E 



fe=3 



< 



(fc + l)a 

-E^ + -an^" + ow)E^ 



fe=3 fe=3 

Note that /„ = 0(ln7i), and consequently lnZ„ = 0(lnln7i). 
So 

1 



(A4)< 



a J3 a; 
1 



In /"^ 
■ dx + O (In Inn) 



< — In'^ n + o(ln^ n). 
2a 



To apply Lemma 6, we extend the definition of on [1, 00) 

by 

/ M = / 1 if ^€[1,2), 

' \ 1+ [loga-J +2[logLloga-J +1J if a; > 2. 

We get 

sup Ep,.[Z(C2)] 

< (1 + 0(1)) ^^^^" ^^^ n?t + C?i / ZB(2; + 2)e-""dx 

Then we can choose /i„ = max{l, InnJ}. This entails 
Ie [Kn + 1 ) log Kn ^ log log n = o(log 71) , 
and therefore 

1 2 
— /fif/^n + 1) Inn = oflog n). 
a 

The remaining term is treated by Lemma 7, which achieves 
the proof of Proposition 5: 

Lemma 7: Let a > be a real number, and let A'„ = 
max{l, [ilnnj}. Then 

n lE{x + 2)e-°"'dx = oi\ogn). 

■ 

Proof of Lemma 6: Let P be an element of Ace-" ■ Let 
us define, for all fc > 0, 

p(fc) - > fc) = J2 



j>k+i 



and 



(Bi) = ^Ep. [ljf.>Af._i.9(X,)] 



C. Contribution 0/ C2 

Proposition 5: Let C and a be positive numbers satisfying 
C > e^". Then 



P6A, 



sup Ep-[/(C2)] < o(log^n). 



Proof: Like in the previous subsection, we give first the 
sketch of the proof, and we delay several technical lemmas. 

sup Ep,.[/(C2)] 
J'eA^„_„. 



< 1 + sup y Ep„ [lx,>AU^ JE{X^ + 1)] . 

We deal with this sum thanks to the following lemma: 
Lemma 6: Let 5 be a positive and non-decreasing function 

on [1,00). Let (i^„)n>i be a non-decreasing sequence of 

positive integers. Then, for all n > 1, 



sup y Epr. [lxi>Mi-i9{Xi)] 
< (l + o(l))5(A'„)-lnn + Cn / g{x + l)e-°"' dx. 



Note that, for all 1 < i < n, Xi and Mi^i are independent 
random variables, and 

P"(M, < fc) P"(V l<i <i, Xj < fc) 

= {i-Pik)y. 

Then we can write 



(Bi)=EEP"(Af.-i = fc^) E P(m)g(m) 

i^l fc>0 ?n>A;+l 
n m — 1 

= y P(m)5(m) y y P(M,_i = fc) 



i=l k=0 



P(l),g(l) + y P(m).g(m) y (1 - p(m - l))'"! 



4=1 



E -PMffM 



1 - (1 -p(m - 1))'^ 



p(m — 1) 
If we take g{x) = 1 for all x, we get 

m>l 



p(7n — 1) 



= E 



,i=l 

<E[Af„]. 
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In the general case, we can split the sum at Kn, and we get 

1 - (1 - Pirn - 1))" 



(Bi) = J2 P{m)9{m)- 



^ P{m)g{m) 

m>K„ + l 



p{m — 1) 

l-(l-p(m-l))' 



< g{K,,) Pir 



p{m — 1) 
1 - (1 -p(to- 1))" 



n(m — 1) 

m>l ' 

+ nP{m)g(m) 

m>K„ + l 

<5(i^„)E[A/„] + Cn sMe"""- 

m>/-S'„ + l 

At this point, we can take the supremum over all sources P 

in Ace-° : 

71 

sup y^Epr. [lxi>Mi-ig{Xi)] 

f 6A^^_. ^^^^ 

< (l + o(l))<7(X„)- Inn + Cn / g{x + l)e-°"' dx. 
" "/if™ 

■ 

Proof of Lemma 7: 

POO 

n / ;£;(x + 2)e-"^da; 
"'if™ 

/•oo 

<n I (log(a: + 2) + 21oglog(x + 3) + l)e-"^da; 
< ne-"^" log(A'„ + 3) 



log a- + 2 log log X + 1 ^_„(^_^„_3) 



log(A;, + 3) 



< e" log(i^„ + 3) ( sup 

\x>K„+3 

logx 



log X + 2 log log a; + 1 



K„+3 log(A:„ + 3) 



log!- 

g-a(x-K„-3) 



log 1 



^+ log(if„ + 3) ' 



= 0(logA'„ 
= o(log n). 

The supremum is correctly defined and bounded, because the 

function „, , 

log X + 2 log log a; + 1 



X 1-^ 



logx 



is continuous and tends to 1 as x tends to the infinity. ■ 

D. Proof of Theorem 4 

The message sent by the ACcode algorithm is compound 
of two strings CI and C2. CI corresponds to the part of 
the message encoded by the arithmetic code, with coding 
probability Q"^^. The arithmetic code encodes a message 
xi:„0 with [-logQ"+i(ii,„0)] + 1 bits. We have 

Epn [- log Qn+1 (0|Xi.„)] = [log(il/„ + 1 + 2n)] 

Ep.[M„ + l] 



< log(2n) + 
= O(logri) 



2n 



thanks to Lemma 4. Therefore the redundancy of ACcode can 
be upper bounded, for all n > 2, by 

sup Epn[Z(ci) + /(CI)] -iy(P")] 

< sup Ep„ [- log Q"(Xl:„) - 
+ sup Ep.[/(C2)] +0(logn). 
We conclude thanks to Propositions 4 and 5. 
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